Quantifying Tackling in Football

A Data-Driven Approach Using the NFL Big Data Bowl Dataset and Advanced Machine Learning Techniques

Dusty Turner

A Quick Reminder

Research Hypothesis


Research Question: Can we determine each defensive player’s probability that they make a tackle on each play on the football field?

Ultimately: Assign a ‘tackles over expected’ value for each player.

Literature Review

Previous NFL Big Data Bowl Competitions

  • 2020: How many yards will an NFL player gain after receiving a handoff?
  • 2021: Evaluate defensive performance on passing plays
  • 2022: Evaluate special teams performance
  • 2023: Evaluate linemen on pass plays

Data

Player & Game Identifiers

  • Game and Play IDs: Unique identifiers for games and individual plays
  • Player Information: Names, jersey numbers, team, position, physical attributes, college

In-Game Player Movements

  • Spatial Data: Player positions, movement direction, speed, and orientation
  • Time and Motion: Specific moments in play, distance covered

Detailed Play Information

  • Play Attributes: Description, quarter, down, yards needed
  • Team & Field Position: Possessing team, defensive team, yardline positions

Scoring and Game Probabilities

  • Scores & Results: Pre-snap scores, play outcomes
  • Probabilities: Win probabilities for home and visitor teams
  • Expected Points: Points added or expected by play outcomes

Tackles, Penalties, and Formations

  • Tackles & Fouls: Tackles, assists, fouls committed, and missed tackles
  • Ball Carrier Info: Identifiers and names of ball carriers
  • Team Formations: Offensive formations and number of defenders

Feature Development

Feature Development

Feature Development

Modeling Overview

Rows: 393,536
Technique: Group Splitting (Game ID / Play ID)

Factors to Consider:
- Tackle (0/1)
- Future X/Y
- S/A/O/Dir of defender
- Position / Alignment cluster Interaction
- Number of Defenders in the Box
- Current and future (.5 seconds) location of the ball
- O/S/A/Dir of ball carrier
- Velocity/direction difference
- Ball in defensive players ‘fan’

Concerns:

  • Computational time
    • Limits tuning parameter options
    • Impacts choices for train/test/validation splits
  • Different coding languages

Modeling Overview

  • Penalized Regression: {GLMNET}
  • Random Forest: {Ranger}
  • XGBoost: {XGBoost}
    • Train: 19,426: 5%
    • Validate: 4,114: 1%
    • Test: 369,996: 96%
    • Baseline Accuracy: 92.9%
  • Neural Network: {Reticulate} (Python Tensor Flow)
    • Train: 184,888: 57%
    • Validate: 46,222: 15%
    • Test: 90,090: 28%
    • Baseline Accuracy: 92.94%

Penalized Regression

\[\text{Minimize } \left\{ \frac{1}{N} \sum_{i=1}^{N} (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 + \lambda \left[ \frac{1 - \alpha}{2} \|\boldsymbol{\beta}\|_2^2 + \alpha \|\boldsymbol{\beta}\|_1 \right] \right\}\]



  • Tuning Parameters
    • Alpha
    • Lambda

Penalized Regression


The best parameters are:

Lambda = 0.00011
Alpha = 0.6723358

Accuracy of 92.44%

Penalized Regression

Random Forest



  • Tuning Parameters
    • Mtry
    • Min_n
    • Trees

Random Forest


The best parameters are:

Mtry = 7
Min_n = 6
Trees = 278

Accuracy of 92.87%.

Random Forest

XGBoost



  • Tuning Parameters
    • Trees
    • Min_n
    • Tree Depth
    • Learning Rate
    • Loss Reduction
    • Sample Size %

XGBoost


The best parameters are:

Trees = 219
Min_n = 9
Tree Depth = 1
Learn Rate = 1.2
Loss Reduction = 24
Sample Size = 1

Accuracy of 92.87%.

XGBoost

Neural Network

def build_model(input_shape):
    model = Sequential([
        Dense(64, activation='relu', input_shape=[input_shape], kernel_regularizer=l2(0.001)),
        BatchNormalization(),  # normalizes layer inputs to stabilize and accelerate neural training
        Dropout(0.3),          # randomly deactivates neurons to prevent overfitting
        Dense(64, activation='relu', kernel_regularizer=l2(0.001)),
        BatchNormalization(),  
        Dropout(0.3),
        Dense(64, activation='relu', kernel_regularizer=l2(0.001)),
        BatchNormalization(),  
        Dropout(0.3),
        Dense(1, activation='sigmoid', kernel_regularizer=l2(0.001))  # Apply L2 regularization here
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model


Accuracy: 92.92%

Neural Network

Neural Network

Neural Network

Points Above or Below Expected




\(\sum_{i=1}^{N} (\mathbb{I}_{\text{tackle}_i} - P(\text{tackle}_i))\)

Where:

  1. \(N\) is the total number of plays
  2. \(P(\text{tackle}_i)\) is the probability of a tackle on play \(i\)
  3. \(\mathbb{I}_{\text{tackle}_i}\) is the indicator function which is 1 if a tackle occurred on play \(i\) and 0 otherwise

Points Above or Below Expected

Penalized Regression
Accuracy: 92.68%
Display Name TOE Position
Talanoa Hufanga 6.00 SS
Jonathan Owens 4.35 FS
Cameron Jordan 4.24 DE
Kevin Byard 4.14 FS
Dre Greenlaw −3.22 ILB
Willie Gay −3.26 OLB
Cody Barton −3.98 MLB
Demario Davis −4.56 MLB
Random Forest
Accuracy: 92.87%
Display Name TOE Position
Talanoa Hufanga 4.92 SS
Maxx Crosby 3.89 DE
Jonathan Owens 3.87 FS
Cameron Jordan 3.79 DE
Xavier McKinney −2.80 FS
Demario Davis −2.91 MLB
Damien Wilson −3.15 MLB
Cody Barton −3.76 MLB
Extreme Gradient Boosting
Accuracy: 92.44%
Display Name TOE Position
Talanoa Hufanga 6.12 SS
Jonathan Owens 4.58 FS
Maxx Crosby 4.16 DE
Cameron Jordan 4.03 DE
Damien Wilson −3.66 MLB
Christian Kirksey −3.97 OLB
Cody Barton −5.08 MLB
Demario Davis −5.57 MLB
Neural Net
Accuracy: 92.92%
Display Name TOE Position
Jonathan Owens 6.10 FS
C.J. Mosley 3.74 ILB
Jihad Ward 3.69 OLB
Grover Stewart 3.45 DT
Marcus Davenport −1.04 DE
Steven Nelson −1.13 CB
Marshon Lattimore −1.14 CB
Julian Love −1.31 SS